Add MiniMax-M3 NVFP4 B200 single-node vLLM benchmark (EAGLE3 spec decode) by Ankur-singh · Pull Request #1933 · SemiAnalysisAI/InferenceX

Ankur-singh · 2026-06-25T17:37:37Z

Adds the minimaxm3-fp4-b200-vllm-mtp config: MiniMax-M3 NVFP4 (nvidia/MiniMax-M3-NVFP4) single-node aggregated vLLM on B200 (runner: b200-dgxc) with EAGLE3 speculative decoding (spec-decoding: mtp, 3 draft tokens via Inferact/MiniMax-M3-EAGLE3).

Config: nvidia-master.yaml entry (fp4 / vllm / runner b200-dgxc), every search-space row spec-decoding: mtp; sweeps tp 4/8 with and without EP and dp-attn at 1k1k and 8k1k, conc 1-512.
Recipe: benchmarks/single_node/fixed_seq_len/minimaxm3_fp4_b200_mtp.sh — overlays vllm-project/vllm PR #46380 (MiniMax-M3 modelopt NVFP4 support, commit 6c08558) before serve; --speculative-config EAGLE3; --block-size 128 (MSA), --language-model-only; prompts routed through the chat template.
Weights: target pre-staged at /scratch/fsw/models/MiniMax-M3-NVFP4 — added a minimaxm3 && fp4 branch to launch_b200-dgxc.sh. The EAGLE3 draft (not staged) is fetched next to the target weights, as in the fp8 b200 MTP recipe.
perf-changelog entry appended.

… decode New minimaxm3-fp4-b200-vllm-mtp config (fp4 vLLM aggregated on b200-dgxc with EAGLE3 speculative decoding, 3 draft tokens via Inferact/MiniMax-M3-EAGLE3). The benchmark script overlays vllm-project/vllm PR #46380 (MiniMax-M3 modelopt NVFP4 support, commit 6c08558) before serve and routes prompts through the chat template. Target weights are pre-staged at /scratch/fsw/models/MiniMax-M3-NVFP4 (added a minimaxm3-fp4 MODEL_PATH branch to launch_b200-dgxc.sh); the EAGLE3 draft is fetched next to the target weights.

claude · 2026-06-25T17:44:04Z

+VLLM_DIR=$(python3 -c "import vllm, os; print(os.path.dirname(vllm.__file__))")
+for f in \
+  model_executor/layers/fused_moe/experts/trtllm_nvfp4_moe.py \
+  model_executor/layers/quantization/modelopt.py \
+  model_executor/layers/quantization/utils/flashinfer_utils.py
+do
+  curl -fsSL "https://raw.githubusercontent.com/vllm-project/vllm/6c08558/vllm/${f}" -o "${VLLM_DIR}/${f}"


🟡 The patch step (lines 26-34) downloads three vLLM source files via curl and runs a Python import as a sanity check, but neither step has a fail-exit guard and the script does not set -e (nor does benchmark_lib.sh). If a download fails (network blip, GitHub raw outage, commit moved) or the import fails on a half-patched install, the script proceeds to vllm serve with broken/partially-patched code, producing a confusing serve-time error instead of a clean setup failure. Mirror the sibling minimaxm3_fp8_b200_mtp.sh:35 pattern: wrap the curl loop and the verify in || { echo "nvfp4 patch failed" >&2; exit 1; }.

Extended reasoning...

What the bug is

benchmarks/single_node/fixed_seq_len/minimaxm3_fp4_b200_mtp.sh:26-34 overlays three files from vllm-project/vllm PR #46380 (commit 6c08558) onto the installed vLLM package, then runs a single-import sanity check:

for f in \ model_executor/layers/fused_moe/experts/trtllm_nvfp4_moe.py \ model_executor/layers/quantization/modelopt.py \ model_executor/layers/quantization/utils/flashinfer_utils.py do curl -fsSL "https://raw.githubusercontent.com/vllm-project/vllm/6c08558/vllm/${f}" -o "${VLLM_DIR}/${f}" done python3 -c "from vllm.model_executor.layers.fused_moe.experts.trtllm_nvfp4_moe import TrtLlmNvFp4ExpertsModular; print('[nvfp4-patch] OK')"

Neither the loop body nor the python3 verification has a || exit 1 guard, and there is no set -e either in this script or inherited from benchmark_lib.sh (the only set -e in benchmark_lib.sh is locally scoped inside one helper function, not at file level).

Code path that triggers it

curl -fsSL returns non-zero on HTTP error (e.g. raw.githubusercontent.com 5xx, the commit getting rewritten, the file getting renamed, or a network blip).

Without set -e, the failed curl invocation does not abort the loop — it just moves on to the next iteration, leaving a stale or missing file behind.

The python3 -c line then imports exactly one of the three patched modules (TrtLlmNvFp4ExpertsModular from trtllm_nvfp4_moe). If that one import succeeds but modelopt.py or flashinfer_utils.py was not refreshed, the sanity check still prints [nvfp4-patch] OK.

Even when the import itself fails, python3 exits non-zero but the script proceeds straight to vllm serve.

Why existing code does not prevent it

set -e is not enabled at the script level (only set -x is set later, which is shell trace, not error-exit).

benchmark_lib.sh does not set -e globally; the only set -e is local to a helper function (around line 1270).

curl -f only affects curl's own exit code; it does not propagate to the shell unless combined with an exit guard.

Impact

A transient GH raw outage, a force-pushed commit, or a renamed file produces a confusing crash deep inside vllm serve (e.g. "unrecognised quantization config" or an ImportError from inside the engine) instead of a clean [nvfp4-patch] failed at script startup. Diagnosis costs an extra round-trip through engine logs for what is fundamentally a setup-time download failure.

How to fix

Mirror the sibling minimaxm3_fp8_b200_mtp.sh:35 pattern:

for f in \ model_executor/layers/fused_moe/experts/trtllm_nvfp4_moe.py \ model_executor/layers/quantization/modelopt.py \ model_executor/layers/quantization/utils/flashinfer_utils.py do curl -fsSL "https://raw.githubusercontent.com/vllm-project/vllm/6c08558/vllm/${f}" -o "${VLLM_DIR}/${f}" \ || { echo "[nvfp4-patch] curl failed for ${f}" >&2; exit 1; } done python3 -c "from vllm.model_executor.layers.fused_moe.experts.trtllm_nvfp4_moe import TrtLlmNvFp4ExpertsModular; print('[nvfp4-patch] OK')" \ || { echo "[nvfp4-patch] import verify failed" >&2; exit 1; }

Step-by-step proof

Concrete failure walk-through: suppose raw.githubusercontent.com returns a 503 for modelopt.py (transient outage):

Iteration 1: curl for trtllm_nvfp4_moe.py succeeds.

Iteration 2: curl for modelopt.py exits non-zero, but with no set -e and no || exit, the shell loop advances. The stale (pre-patch) modelopt.py from the base image remains on disk.

Iteration 3: curl for flashinfer_utils.py succeeds.

The verify line imports TrtLlmNvFp4ExpertsModular — which was successfully overwritten — and prints [nvfp4-patch] OK. The half-patched state goes undetected.

vllm serve starts, hits the unpatched modelopt.py when loading the NVFP4 quant config, and crashes with an opaque engine error.

With the fix, step 2 immediately prints [nvfp4-patch] curl failed for model_executor/layers/quantization/modelopt.py and exits 1 — a one-line setup failure instead of an engine-internal crash. Severity is nit because the script still ultimately errors; the failure is just less localised and slower to diagnose.

claude · 2026-06-25T17:44:04Z

+    - "Image vllm/vllm-openai:vllm-minimax-m3-perf-x86_64-13.0.1-7a67223; benchmark script overlays vllm-project/vllm PR #46380 (MiniMax-M3 modelopt NVFP4 support, commit 6c08558) before serve; prompts routed through the chat template"
+    - "Target weights pre-staged at /scratch/fsw/models/MiniMax-M3-NVFP4 (added minimaxm3-fp4 MODEL_PATH branch to launch_b200-dgxc.sh); EAGLE3 draft fetched next to the target weights; --block-size 128 (MSA), --language-model-only"
+    - "Sweeps tp 4/8 with and without EP and dp-attn at 1k1k and 8k1k, conc 1-512"
+  pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/XXX


🟡 The new perf-changelog entry at perf-changelog.yaml:4194 has pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/XXX — the XXX is a literal placeholder that was never substituted. Every other entry in this file uses the real PR number, so this will ship as a broken link in the released changelog. Trivial fix: replace XXX with 1933 before merge.

Extended reasoning...

What the bug is: The perf-changelog.yaml entry added by this PR (lines 4187-4194) ends with:\n\nyaml\n pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/XXX\n\n\nThe XXX is a literal three-character placeholder that should have been substituted with the actual PR number before the change was pushed. This PR is #1933 (per the PR metadata).\n\nWhy this is a defect: perf-changelog.yaml is a structured, machine- and human-readable record of perf changes. Every other entry in the file uses the actual numeric PR id, e.g. line 4185 ends with pull/1927, line 4176 with pull/1762, and earlier entries reference 1865, 1706, etc. Anyone (or anything) consuming this file — release notes generators, the GitHub UI when the markdown is rendered, humans clicking the link — will land on https://github.com/SemiAnalysisAI/InferenceX/pull/XXX, which is not a valid PR. GitHub will return a 404.\n\nWhy existing code doesn't prevent it: There is no schema validation on pr-link that enforces a numeric trailing component (the value is just a string), and the PR was opened from a branch where the author hand-edited the changelog entry from a template and forgot to fill in the placeholder. CI passing doesn't catch this kind of metadata typo.\n\nImpact: Once merged, the changelog will contain a permanently broken link. The fix is more disruptive after merge (requires a follow-up PR) than now (one character change to the diff).\n\nStep-by-step proof:\n1. Open perf-changelog.yaml after this PR merges.\n2. Scroll to the bottom (lines 4187-4194), the new minimaxm3-fp4-b200-vllm-mtp entry.\n3. Read pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/XXX.\n4. Click / curl that URL → GitHub returns 404 because XXX is not a valid PR number.\n5. Compare against the immediately preceding entry on line 4185, pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/1927 — that one resolves correctly.\n\nHow to fix: In the diff, change line 4194 from:\nyaml\n pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/XXX\n\nto:\nyaml\n pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/1933\n

…lm-mtp # Conflicts: # perf-changelog.yaml

github-actions · 2026-06-25T17:54:35Z

see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=28189601386
see unofficial run visualizer at https://inferencex.semianalysis.com/evaluation?unofficialRun=28189601386

functionstackx

fix patchwork as discussed in slack
missing vllm recipes

The vllm/vllm-openai:vllm-minimax-m3-perf-x86_64-13.0.1-8b00f41 image bakes in MiniMax-M3 modelopt NVFP4 support (vllm-project/vllm PR #46380), so the EAGLE3 benchmark script no longer overwrites vLLM files at runtime.

…lm-mtp # Conflicts: # perf-changelog.yaml

github-actions · 2026-06-26T04:52:16Z

see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=28197848860
see unofficial run visualizer at https://inferencex.semianalysis.com/evaluation?unofficialRun=28197848860

github-actions · 2026-06-26T17:33:42Z

see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=28197848860
see unofficial run visualizer at https://inferencex.semianalysis.com/evaluation?unofficialRun=28197848860

Ankur-singh requested a review from a team June 25, 2026 17:37

Ankur-singh requested review from jgangani and kedarpotdar-nv as code owners June 25, 2026 17:37

github-project-automation Bot added this to InferenceMAX Board Jun 25, 2026

Update perf-changelog pr-link for #1933

87930c9

claude Bot reviewed Jun 25, 2026

View reviewed changes

Merge remote-tracking branch 'origin/main' into minimaxm3-fp4-b200-vl…

d9c614d

…lm-mtp # Conflicts: # perf-changelog.yaml

Ankur-singh added the full-sweep-enabled label Jun 25, 2026

functionstackx requested changes Jun 25, 2026

View reviewed changes

Ankur-singh added 2 commits June 25, 2026 13:17

Drop runtime NVFP4 patch; bump perf image to ...-8b00f41

c32cb40

The vllm/vllm-openai:vllm-minimax-m3-perf-x86_64-13.0.1-8b00f41 image bakes in MiniMax-M3 modelopt NVFP4 support (vllm-project/vllm PR #46380), so the EAGLE3 benchmark script no longer overwrites vLLM files at runtime.

Merge remote-tracking branch 'origin/main' into minimaxm3-fp4-b200-vl…

960730c

…lm-mtp # Conflicts: # perf-changelog.yaml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add MiniMax-M3 NVFP4 B200 single-node vLLM benchmark (EAGLE3 spec decode)#1933

Add MiniMax-M3 NVFP4 B200 single-node vLLM benchmark (EAGLE3 spec decode)#1933
Ankur-singh wants to merge 5 commits into
mainfrom
minimaxm3-fp4-b200-vllm-mtp

Ankur-singh commented Jun 25, 2026

Uh oh!

claude Bot Jun 25, 2026

Uh oh!

claude Bot Jun 25, 2026

Uh oh!

github-actions Bot commented Jun 25, 2026

Uh oh!

functionstackx left a comment

Uh oh!

github-actions Bot commented Jun 26, 2026

Uh oh!

github-actions Bot commented Jun 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

Ankur-singh commented Jun 25, 2026

Uh oh!

claude Bot Jun 25, 2026

Choose a reason for hiding this comment

What the bug is

Code path that triggers it

Why existing code does not prevent it

Impact

How to fix

Step-by-step proof

Uh oh!

claude Bot Jun 25, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions Bot commented Jun 25, 2026

Uh oh!

functionstackx left a comment

Choose a reason for hiding this comment

Uh oh!

github-actions Bot commented Jun 26, 2026

Uh oh!

github-actions Bot commented Jun 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants